This script takes a deep dive into Sentinel 2 labels for a more rigorous analysis of inconsistent band data and outliers in the filtered label dataset. Here we will determine if any more label data points should be removed from the training dataset and whether or not we can glean anything from the metadata in the outlier dataset to be able to pre-emptively toss out scenes when we go to apply the classification algorithm.
harmonize_version = "2024-04-25"
outlier_version = "2024-04-25"
S2 <- read_rds(paste0("data/labels/harmonized_SEN2_labels_", harmonize_version, ".RDS"))
Just look at the data to see consistent (or inconsistent) user-pulled data and our pull, here, our user data are in “BX” format and the re-pull is in “SR_BX” format. These are steps to assure data quality if the volunteer didn’t follow the directions explicitly.
pmap(.l = list(user_band = S2_user,
ee_band = S2_ee,
data = list(S2),
mission = list("SENTINEL_2")),
.f = make_band_comp_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
Handful of fliers here, we’ll compare across the user-pulled and re-pulled data for consistency.
S2_inconsistent <- S2 %>%
filter(B2 != SR_B2 | B3 != SR_B3 |
B4 != SR_B4 | B5 != SR_B5 | B6 != SR_B6 | B7 != SR_B7 |
B11 != SR_B11 | B12 != SR_B12)
S2_inconsistent %>%
group_by(class) %>%
summarise(n_labels = n()) %>%
kable()
| class | n_labels |
|---|---|
| algalBloom | 2 |
| cloud | 53 |
| darkNearShoreSediment | 3 |
| lightNearShoreSediment | 13 |
| offShoreSediment | 20 |
| openWater | 40 |
| other | 4 |
| shorelineContamination | 2 |
We should see if there is any common contributor or scene here. There were no data handling differences between the user-pull and our second pull, so we can’t dismiss the cloud differences.
S2_inconsistent %>%
group_by(vol_init) %>%
summarise(n_dates = length(unique(date)),
n_labels = n()) %>%
kable()
| vol_init | n_dates | n_labels |
|---|---|---|
| ANK | 7 | 24 |
| BJM | 9 | 51 |
| JFL | 2 | 6 |
| LAE | 1 | 7 |
| LRCP | 3 | 37 |
| SKS | 2 | 9 |
| TJB | 1 | 3 |
The inconsistencies are spread across many dates, so I don’t think there is any special handling necessary here except for removing these from the dataset.
S2_filtered <- S2 %>%
anti_join(S2_inconsistent) %>%
filter(# or where any re-pulled band value is greater than 1, which isn't a valid value
if_all(S2_ee,
~ . <= 1))
And plot:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
And now let’s look at the data by class:
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
We aren’t actually modeling “other” (not sufficient observations to classify) or “shorelineContamination” (we’ll use this later to block areas where there is likely shoreline contamination in the AOI). Additionally, “algalBloom” is likely a processing issue, and n is too low to use that data, so we’ll drop that class as well. Finally, we’ll drop the user band values since we’re done with comparisons.
S2_for_class_analysis <- S2_filtered %>%
filter(!(class %in% c("other", "shorelineContamination", "algalBloom"))) %>%
select(-c(B2:B12))
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
There are statistical outliers within this dataset and they may impact the interpretation of any statistical testing we do. Let’s see if we can narrow down when those outliers and/or glean anything from the outlier data that may be applicable to the the application of the algorithm. Outliers may be a systemic issue (as in the scene is an outlier), it could be a user issue (a user may have been a bad actor), or they just might be real. This section asks those questions. The “true outliers” that we dismiss from the dataset will also be used to help aid in interpretation/application of the algorithm across the Landsat stack, so it is important to make notes of any patterns we might see in the outlier dataset.
## [1] "Classes represented in outliers:"
## [1] "darkNearShoreSediment" "lightNearShoreSediment" "offShoreSediment"
## [4] "openWater"
Okay, 24 outliers (>1.5*IQR) out of 999 - and they are all from non-cloud groups.
How many of these outliers are in specific scenes?
S2_out_date <- outliers %>%
group_by(date, vol_init) %>%
summarize(n_out = n())
S2_date <- S2_for_class_analysis %>%
filter(class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_tot = n())
S2_out_date <- left_join(S2_out_date, S2_date) %>%
mutate(percent_outlier = n_out/n_tot*100) %>%
arrange(-percent_outlier)
S2_out_date %>%
kable()
| date | vol_init | n_out | n_tot | percent_outlier |
|---|---|---|---|---|
| 2022-06-23 | BJM | 6 | 24 | 25.000000 |
| 2023-04-11 | BJM | 6 | 30 | 20.000000 |
| 2019-06-06 | LRCP | 9 | 148 | 6.081081 |
| 2020-09-28 | BJM | 1 | 32 | 3.125000 |
| 2021-08-29 | ANK | 1 | 62 | 1.612903 |
| 2019-05-12 | LRCP | 1 | 86 | 1.162791 |
There are two images with higher outliers, let’s look at them quickly:
2022-06-23
There’s some weird stuff happening here - while this does look like some near shore sediment, I think there are also some aerosol issues with this scene. If you zoom in on other waterbodies on this date, they have a similar grey-brown sheen to them, and if you zoom in on this image, it doesn’t look… natural?
2023-04-11
This is a really cool looking scene. Lots of things going on here: 1) the end of ice break up 2) super hazy/cirrus clouds 3) snow on the ground might be affecting the SR processes. We should probably ditch this scene, but we’ll do that at the end.
Let’s look at the tile-level metadata for the highest scenes here:
S2_out_date %>%
filter(percent_outlier >= 20) %>%
select(date, vol_init) %>%
left_join(., S2) %>%
select(date, vol_init, GENERAL_QUALITY:RADIOMETRIC_QUALITY,
SNOW_ICE_PERCENTAGE_mean,
SNOW_ICE_PERCENTAGE_max,
SNOW_ICE_PERCENTAGE_min) %>%
distinct() %>%
kable()
| date | vol_init | GENERAL_QUALITY | GEOMETRIC_QUALITY | RADIOMETRIC_QUALITY | SNOW_ICE_PERCENTAGE_mean | SNOW_ICE_PERCENTAGE_max | SNOW_ICE_PERCENTAGE_min |
|---|---|---|---|---|---|---|---|
| 2022-06-23 | BJM | PASSED | PASSED | PASSED | 0.000000 | 0.00000 | 0.000000 |
| 2023-04-11 | BJM | PASSED | PASSED | PASSED | 9.856787 | 28.13553 | 0.000361 |
Some detection of snow/ice in the April image, but since these are aggregated values across multiple tiles, hard to tell what to glean from this.
How many of these outliers have tile-level cloud metadata with high values?
The outlier dataset contains 7 (29.2%) where the mean pixel cloud cover was > 60% and 6 (25%) where the mean pixel cirrus cover was > 20%. The filtered dataset contains 97 (9.7%) where the mean pixel cloud cover was > 60% and 47 (4.7%) where the mean pixel cirrus cover was > 20%. This is a little more helpful than in Landsat images. Higher pixel cloud cover and cirrus tend to be in outliers than in the filtered dataset.
Do any of the labels have QA pixel indications of opaque cloud or cirrus?
S2_for_class_analysis %>%
mutate(QA = case_when(cirrus == 1 | cirrus_scl == 1 ~ "cirrus",
hi_prob_cloud == 1 | med_prob_cloud == 1 ~ "cloud",
snow_ice == 1 ~ "snow ice",
opaque == 1 ~ "opaque cloud",
cloud_shadow == 1 ~ "cloud shadow",
dark_pixel == 1 ~ "dark pix",
TRUE ~ "clear")) %>%
group_by(QA) %>%
filter(class != "cloud") %>%
summarize(n_tot = n()) %>%
kable()
| QA | n_tot |
|---|---|
| cirrus | 14 |
| clear | 494 |
| opaque cloud | 10 |
Let’s look at the cirrus and opaque cloud group to see if there is anything egregious:
S2_tot <- S2_for_class_analysis %>%
group_by(date, vol_init) %>%
summarise(n_tot_labels = n())
S2_for_class_analysis %>%
filter(cirrus == 1 | cirrus_scl == 1 | opaque == 1,
class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_qa_flag = n()) %>%
left_join(S2_tot) %>%
mutate(perc_qa_flag = round(n_qa_flag/n_tot_labels*100, 1)) %>%
arrange(-perc_qa_flag) %>%
kable()
| date | vol_init | n_qa_flag | n_tot_labels | perc_qa_flag |
|---|---|---|---|---|
| 2020-10-26 | BJM | 9 | 37 | 24.3 |
| 2023-04-11 | BJM | 12 | 71 | 16.9 |
| 2020-09-28 | BJM | 2 | 103 | 1.9 |
| 2019-05-12 | LRCP | 1 | 109 | 0.9 |
The top tow images here are the same from the outlier analysis. I think we’ll trust the QA bit to help with all this.
Sentinel 2 doesn’t have an aerosol QA flag, but SR_B1 is aerosol, so we can look at distributions to examine this. Let’s look at the distribution across all data:
S2_for_class_analysis %>%
ggplot(., aes(x = SR_B1)) +
geom_histogram(binwidth = 0.01) +
facet_grid(class ~ .) +
theme_bw() +
scale_x_continuous(breaks = seq(0, max(S2_for_class_analysis$SR_B1), 0.05))
B1_summary <- S2_for_class_analysis %>%
group_by(class) %>%
summarise(fifth_B1 = quantile(SR_B1, 0.05),
mean_B1 = mean(SR_B1),
ninetyfifth_B1 = quantile(SR_B1, 0.95))
kable(B1_summary)
| class | fifth_B1 | mean_B1 | ninetyfifth_B1 |
|---|---|---|---|
| cloud | 0.168800 | 0.5004647 | 0.860100 |
| darkNearShoreSediment | 0.011015 | 0.0320797 | 0.048465 |
| lightNearShoreSediment | 0.033250 | 0.0546151 | 0.088225 |
| offShoreSediment | 0.027685 | 0.0422233 | 0.058135 |
| openWater | 0.017870 | 0.0293747 | 0.044230 |
Let’s see how many of the oultiers have Rrs values higher than the 95th percentile in their class:
outliers %>%
left_join(B1_summary) %>%
filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>%
nrow()
## [1] 13
This is about 50% of the outliers, so a pretty decent QA measure.
For the purposes of training data, we are going to throw out any labels that have qa flags associated with them
S2_training_labels <- S2_for_class_analysis %>%
filter(if_all(c("cirrus", "cirrus_scl", "cloud_shadow",
"dark_pixel", "hi_prob_cloud", "med_prob_cloud",
"opaque", "snow_ice"),
~ . == 0) |
class == "cloud") %>%
filter(!(date %in% c("2022-06-23", "2023-04-11")))
Let’s now see how many of these have high percentile SR_B1:
S2_training_labels %>%
left_join(B1_summary) %>%
filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>%
nrow()
## [1] 14
And make sure none of them are concentrated in a specific scene:
S2_training_labels %>%
left_join(B1_summary) %>%
filter(SR_B1 > ninetyfifth_B1, class != "cloud") %>%
group_by(date, vol_init) %>%
summarise(n_high_aero = n())
## # A tibble: 6 × 3
## # Groups: date [6]
## date vol_init n_high_aero
## <date> <chr> <int>
## 1 2019-05-12 LRCP 2
## 2 2019-06-06 LRCP 3
## 3 2020-09-28 BJM 1
## 4 2020-10-26 BJM 2
## 5 2020-11-30 BJM 1
## 6 2022-11-30 BJM 5
This seems totally reasonable.
We do want to have an idea of how different the classes are, in regards to band data. While there are a bunch of interactions that we could get into here, for the sake of this analysis, we are going to analyze the class differences by band.
Kruskal-Wallis assumptions:
ANOVA assumptions:
We can’t entirely assert sample independence and we know that variance and distribution is different for “cloud” labels, but those data also are visibly different from the other classes.
In order to systematically test for differences between classes and be able to intepret the data, we will need to know some things about our data:
With this workflow, most classes are statistically different - below are the cases where the pairwise comparison were not deemed statistically significant:
| band | group1 | group2 | n1 | n2 | statistic | p | p.adj | p.adj.signif |
|---|---|---|---|---|---|---|---|---|
| SR_B2 | darkNearShoreSediment | offShoreSediment | 64 | 117 | 2.7190570 | 0.0065468 | 0.0654683 | ns |
| SR_B2 | darkNearShoreSediment | openWater | 64 | 154 | -1.6152259 | 0.1062618 | 1.0000000 | ns |
| SR_B2 | lightNearShoreSediment | offShoreSediment | 102 | 117 | -2.6802513 | 0.0073567 | 0.0735669 | ns |
| SR_B3 | darkNearShoreSediment | offShoreSediment | 64 | 117 | 1.2549300 | 0.2095042 | 1.0000000 | ns |
| SR_B4 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 2.4622904 | 0.0138053 | 0.1380529 | ns |
| SR_B4 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.6410142 | 0.1007945 | 1.0000000 | ns |
| SR_B5 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 2.0770150 | 0.0378002 | 0.3780018 | ns |
| SR_B5 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -2.1130300 | 0.0345982 | 0.3459820 | ns |
| SR_B6 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 1.4614875 | 0.1438817 | 1.0000000 | ns |
| SR_B6 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.9932388 | 0.0462353 | 0.4623530 | ns |
| SR_B7 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 1.3916638 | 0.1640242 | 1.0000000 | ns |
| SR_B7 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.9394492 | 0.0524467 | 0.5244667 | ns |
| SR_B8 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 1.3596677 | 0.1739351 | 1.0000000 | ns |
| SR_B8 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.9968907 | 0.0458371 | 0.4583706 | ns |
| SR_B8A | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 1.2685325 | 0.2046079 | 1.0000000 | ns |
| SR_B8A | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.6112531 | 0.1071246 | 1.0000000 | ns |
| SR_B11 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 0.5292842 | 0.5966083 | 1.0000000 | ns |
| SR_B11 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.6296461 | 0.1031763 | 1.0000000 | ns |
| SR_B11 | lightNearShoreSediment | offShoreSediment | 102 | 117 | -2.4933917 | 0.0126529 | 0.1265292 | ns |
| SR_B11 | offShoreSediment | openWater | 117 | 154 | -2.4124536 | 0.0158456 | 0.1584555 | ns |
| SR_B12 | darkNearShoreSediment | lightNearShoreSediment | 64 | 102 | 0.4650300 | 0.6419100 | 1.0000000 | ns |
| SR_B12 | darkNearShoreSediment | offShoreSediment | 64 | 117 | -1.4404613 | 0.1497369 | 1.0000000 | ns |
| SR_B12 | lightNearShoreSediment | offShoreSediment | 102 | 117 | -2.2006275 | 0.0277624 | 0.2776241 | ns |
| SR_B12 | offShoreSediment | openWater | 117 | 154 | -1.8261056 | 0.0678344 | 0.6783435 | ns |
Eep, this isn’t great, but the pairwise confusion seems concentrated in dark near shore sediment.
Let’s look at the boxplots of the non-cloud classes:
S2_training_labels_no_clouds <- S2_training_labels %>%
filter(class != "cloud")
pmap(.l = list(data = list(S2_training_labels_no_clouds),
data_name = list("SENTINEL_2"),
band = S2_ee_all),
.f = make_class_comp_plot)
## [[1]]
##
## [[2]]
##
## [[3]]
##
## [[4]]
##
## [[5]]
##
## [[6]]
##
## [[7]]
##
## [[8]]
##
## [[9]]
##
## [[10]]
This actually makes me feel a little better. I think there might be enough here for ML to differentiate between classes.
There are in SR_B1, 0 in SR_B2, 0 in SR_B3, 1 in SR_B4, 1 in SR_B5, 1 outliers in SR_B6, 1 in SR_B7, 1 in SR_B8, 0 in SR_B8A, 10 in SR_B11, and 13 in SR_B12.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
Let’s zoom in on the sediment classes.
DNSS: dark near shore sediment, LNSS: light near shore sediment, OSS: offshore sediment
Things to note for Sentinel 2:
write_rds(S2_training_labels, paste0("data/labels/S2_labels_for_tvt_", outlier_version, ".RDS"))